A Soft Label Method for Medical Image Segmentation with Multirater Annotations

In medical image analysis, collecting multiple annotations from different clinical raters is a typical practice to mitigate possible diagnostic errors. For such multirater labels' learning problems, in addition to majority voting, it is a common practice to use soft labels in the form of full-probability distributions obtained by averaging raters as ground truth to train the model, which benefits from uncertainty contained in soft labels. However, the potential information contained in soft labels is rarely studied, which may be the key to improving the performance of medical image segmentation with multirater annotations. In this work, we aim to improve soft label methods by leveraging interpretable information from multiraters. Considering that mis-segmentation occurs in areas with weak supervision of annotations and high difficulty of images, we propose to reduce the reliance on local uncertain soft labels and increase the focus on image features. Therefore, we introduce local self-ensembling learning with consistency regularization, forcing the model to concentrate more on features rather than annotations, especially in regions with high uncertainty measured by the pixelwise interclass variance. Furthermore, we utilize a label smoothing technique to flatten each rater's annotation, alleviating overconfidence of structural edges in annotations. Without introducing additional parameters, our method improves the accuracy of the soft label baseline by 4.2% and 2.7% on a synthetic dataset and a fundus dataset, respectively. In addition, quantitative comparisons show that our method consistently outperforms existing multirater strategies as well as state-of-the-art methods. This work provides a simple yet effective solution for the widespread multirater label segmentation problems in clinical diagnosis.


Introduction
Recently, deep learning techniques have made impressive progress on image segmentation tasks and have become a popular choice in the computer vision community [1]. Typically, supervised learning in deep learning is based on the assumption that there is a ground truth (GT). However, the truth is a lie; that is, there is often a lack of human consensus on the category of an object [2][3][4]. Especially, in medical image segmentation, which is based on knowledge and experience, disagreements between raters are fairly common [5,6]. Inter-rater variability, as frequently reported by relevant research in the clinical feld, usually leads to difculties in segmenting areas of high uncertainty [7,8].
To mitigate this inter-rater variability, the most basic yet common approach is the majority voting approach, in which opinions agreed by a majority of raters are taken as true. However, the majority voting approach essentially discards the rich information contained in the multirater labels through one-hot operation (e.g., the probability distribution [0.6, 0.3, and 0.1] is transformed into a hard label [1, 0, and 0]). To combat this issue, soft-label methods that average rater annotations have been intensively investigated [9,10]. Furthermore, Islam and Glocker [11] introduced a label smoothing method that incorporates fuzzy information about edges into multirater soft labels, called spatially varied label smoothing (SVLS).
However, when we applied the soft labels method to the multirater optic cup (OC) and optic disc (OD) segmentation of the fundus image task, fnding that the areas where segmentation errors occur coincides with the highly divergent areas to some extent, see Figure 1. As demonstrated in Figure 2, the pixelwise loss and error rates of predictions are statistically positively correlated with the interclass variance which indicates the divergence between raters. We tentatively attempt to explain this phenomenon as follows: (i) An explanation is that the divergence area is highly uncertain, and the higher the uncertainty of the label, the lower the penalty imposed on the predicted distribution [12]. Highly uncertain annotations make it difcult to impose strong and precise constraints on the model, which is similar to weakly supervised learning that lacks accurate annotations [13]. Te dependence on annotations in weakly supervised learning is weakened and replaced by a focus on features [14]. (ii) Furthermore, we provide an intuitive interpretation that is more consistent with the multirater labels segmentation task: uncertainty refects pixelwise image difculty, where areas with high difculty are more challenging for the model to accurately segment. Image difculty, which is related to the visual characteristics of the images, such as image quality and occlusion of the area of the lesions, is one of the causal factors of inter-rater variability [15]. As demonstrated in Figure 1(a), the blood vessels occluding the edge region of the OD not only make ophthalmologists' judgment difcult but also hinder the accurate prediction of the deep neural network. In contrast to existing methods [16,17] that treat difculty as image level, we innovatively consider difculty to be pixelwise for segmentation tasks.
In conclusion, the regions with high inter-rater variability have more difcult features but only weaker supervision, which could be a cause of mis-segmentation. In this work, we aim to improve the performance level of the soft label approach on multirater labels' segmentation task based on the previously mentioned explanations. A way to get the best of both sides is to increase the focus on image features while reducing the reliance on highly uncertain annotations. Consequently, we propose a supervised segmentation network that is constrained by consistency regularization. Specifcally, consistency regularization exploits the augmentation invariance of images to optimize the feature space while avoiding relying simply on labels and compensating for the disadvantage of unreliable local annotations. Te uncertainty as the prior knowledge is formulated as the soft labels' interclass variance, which drives the proposed model's local diference training. In addition, the SVLS approach, which incorporates edge fuzziness into soft labels, is used to soften average expert labels.
Experiments are performed on a synthetic dataset with great disagreement as well as a real-world dataset. In these experiments, our method consistently outperforms existing multirater strategies and state-of-the-art (SOTA) methods. To verify the generalization of the proposed method, we additionally conduct generalization experiments on two other types of datasets.
In summary, the main contributions of this study are as follows: (1) To embed consistency/inconsistency of multirater into the model, the soft labels obtained by averaging softened annotations of raters are used as GT. (2) We provide thinking that disagreement among multiple raters, i.e., uncertainty, can be quantifed from soft labels and used as prior knowledge to refect the pixel-level difculty of an image. (3) We propose to use consistency regularization to improve the model's attention to features and reduce dependence on GT, especially in regions of high uncertainty. Without introducing additional parameters, the accuracy of our method is improved over that of other methods on synthetic and realworld datasets.

Related Works
Te problem of multirater labels' segmentation caused by inter-rater variability has started to pique the interest of researchers. Tere is a study showing that the observed labels depend on three causal factors: the true label, the expertise of the rater, and the image difculty [16]. For the method of obtaining the true label, it is a common practice to use majority voting [18] and STAPLE [19] or other label fusion strategies to obtain the ground-truth labels [11,20] so that they can be adapted to the general segmentation model. However, simple label fusion methods neither do take advantage of any image features nor do they carry the inter- 2 Computational Intelligence and Neuroscience rater variability through to the model. Recently, several eforts have started to explore the expertise of the rater using label sampling strategies [21] or rater modeling strategies [22]. For instance, Zhang et al. [23] proposed to use confusion matrices to the model preference of annotators, obtaining segmentation prediction with the least noise by optimizing two coupled convolutional neural networks (CNNs). Yu et al. [15] proposed a multibranch model for the multirater glaucoma classifcation task, encouraging the specifcity branch and the sensitivity branch to generate consistent/opposing predictions for consensus/disagreement samples. Ji et al. [24] proposed MRNet, which embeds the expertise of individual annotators into the model to generate calibrated predictions under diferent expertise levels for medical image segmentation. However, there still lacks efective research on the image difculty represented by image features in the multirater label segmentation task. Furthermore, we consider that multirater labels' segmentation is weakly supervised learning with inaccurate labels, which has not been explored before. Although our approach is uncertainty-driven, unlike works, such as Monte Carlo dropout [25] and ensembles [26,27], that evaluate uncertainty and produce multiple segmentation hypotheses, our work aims to learn a deterministic single-output deep model.

Methodology
Te main architecture of our model is illustrated in Figure 3, which is composed of three main parts: (a) segmentation network with consistency regularization for conveying more information about the input; (b) asymmetrical regularization part for generating uncertainty mask to realize local selfensembling in supervised learning; (c) multirater labels fusion part for obtaining a soft label for each input as the supervised target containing uncertainty. In the test and application phase, just the trained network is required to predict the segmentation of the input image.

Problem Defnition.
In this article, we consider the problem of learning a segmentation model from labels annotated by multiple human raters. Given the images X W×H×L � x n N n�1 and the corresponding one-hot labels Y W×H×C � y (r) n r�1,···,R n�1,···,N (W, H, L, C denote the width, height, channels, and classes), where N is the number of samples and R is the number of raters, each image is independently annotated by raters based on their personal experiences. Te objective of the multirater label segmentation task is to learn the projection function F(·), mapping the input image x n to the estimated prediction y n which is one-hot form encoded by the full probability distribution p n . In our article, p n is encouraged to be as similar as p n , which is the soft label fused by Y n .

Soft Labels.
Recently, increasing studies have proposed training a model using soft labels for accounting for the high uncertainty in lesion or structure borders' delineation [11,[28][29][30][31]. Averaging multirater labels is an intuitive way to obtain soft labels in multirater annotation tasks as follows: Although the average strategy incorporates uncertainty from inter-rater variability into soft labels, it indulges the overconfdence of each rater. Terefore, we soften each hard label by SVLS and then average them to obtain p n which contains spatial and inter-rater uncertainty as follows: where (i, j) is the position of pixel and w is a weight matrix SVLS determines the probability of the target pixel based on its neighboring pixels, achieved by a Gaussian-like weight matrix that is applied across the one-hot encoded rater labels y (r) to obtain a soft probability distribution.
Te transmission of uncertainty information into the model is inseparable from the appropriate loss function. Tere is the performance of several common loss functions in Section 4.4, including soft cross-entropy loss, soft dice loss, and soft focal loss. By comparison, the soft cross-entropy loss is selected as the optimization objective, encouraging the probability distribution of prediction p n to be identical to that of the soft label as follows:

Label Uncertainty Measure.
It is equally crucial to model uncertainty at the pixel-level as to improve the model's performance, particularly in medical scenarios [25]. Unlike work that uses stochastic networks [32,33] to model uncertainty, we improve multirater models using uncertainty as a source of prior knowledge. Specifcally, we consider the pixelwise interclass variance Var W×H n n�1,···,N that refects the uncertainty caused by inter-rater variability and spatial variation. It is inversely proportional to entropy, meaning that the lower the variance, the greater the entropy and, hence, the greater the uncertainty. Te appropriate uncertainty can enhance the generalization and calibration of the model. However, the high uncertainty would be detrimental to the model as noise. In the position of the (i, j) th pixel, the variance between classes Var(p (i,j) n ) can be formulated by the following: n (c) � 1. We propose to use uncertainty as a threshold to assign diferent optimization objectives to diferent areas of the image. Specifcally, the labels of areas with high uncertainty are no longer decisive but are replaced with constraints on the feature space, which will be clarifed in the next section. In areas with high rater agreement, soft labels and feature constraints work together to optimize the model. For convenience, we refer to this uncertainty-driven local diferential optimization as asymmetrical regularization. Te threshold comes into play in the form of a mask of 0-1, acting directly on the loss function. Mask is diferentiated into mask (GT) and mask (CR) based on the threshold, which correspond to areas of low and high uncertainty, respectively, as follows: where 1 is the indicator function and I is the identity matrix with the same shape as Var n .

Consistency Regularization.
To improve attention to features and optimize feature space, we propose using consistency regularization as an extra constraint on the model, which has been utilized in semisupervised learning [34] and unsupervised learning [35]. Consistency regularization is a type of self-ensemble learning because it only relies on the images themselves to learn. Inspired by Li et al. [34], we apply rotation consistency to this work. Specifcally, there is a problem in the segmentation task using CNN: when the inputs of CNN are rotated, the corresponding network predictions would not be rotated in the same way [36] as follows: where π k is a rotation to the image (i.e., horizontal, vertical, or mixed fip) and θ is the parameters of the network. Te feature space is automatically optimized when the model is encouraged to make the same judgments about elements before and after rotation. In this article, we use the soft crossentropy loss function as the optimization target of rotation consistency regularization term: θ π k x n log π k θ x n .
When formulas (3) and (7) are combined, the total loss function is as follows: θ π k x n log π k p n , (8) where λ 1 + λ 2 � 1. By minimizing the loss function, the network is urged to focus more on the image content than on the regression of GT alone [37]. So far, image features can be fully expressed through self-ensembling. In regions of divergence where uncertainty is high, supervision of labels is entirely replaced by unsupervised self-ensembling. Without introducing extra parameters and structures, asymmetrical regularization is achieved by covering the soft label with an uncertainty-based mask (formulas (5) and (7)). Finally, updated formula (8) is shown as follows: θ π k x n log π k p n .

Experiments
In this section, we introduce the experimental dataset, implementation details, and evaluation metrics. In order to explore the best performance under diferent combinations of the loss and uncertainty threshold value, we conduct quantitative experiments with diferent setups on the MNIST and the RIGA validation set in Section 4.4. For comparison with other methods, the common label fusion approach and other SOTA approaches for multirater labels segmentation are used as the benchmark. Te results are listed in Section 4.5, showing that our method can exploit the uncertainty of multirater annotations to improve segmentation performance. Additionally, ablation experiments are conducted to evaluate the efcacy of each component of our method.

Datasets
(i) MNIST is a handwritten digits dataset with 60,000 training and 10,000 test examples. All images are 28 × 28 grayscale versions of the handwritten numbers 0-9. Zhang et al. [23] synthesized a dedicated dataset of multirater annotation tasks based on MNIST, which simulates raters with different biases to obtain multiple labels by using Morpho-MNIST software [38]. Specifcally, the frst rater provides good segmentation with approximate GT, the second rater tends to oversegment, the third rater tends to undersegmentation, the fourth rater is prone to the combination of small fractures and oversegmentation, and the ffth rater always annotates everything as the background. We train a model using all fve raters' annotations and fnally test the model performance on GT. (ii) RIGA is a publicly available dataset for joint OC and OD segmentation from the University of Michigan [39].  [40], which are specifcally designed to evaluate interrater variability. Te QUBIQ-Kidney images are 2D CT slices (20 cases for training and 4 cases for testing) in which the kidneys are manually annotated by three raters. Te QUBIQ-Prostate images are 2D MRI slices (48 cases for training and 7 cases for testing) in which the prostate is manually annotated by six raters. To match the task objective of the QUBIQ challenge, GT and prediction are binarized at fve probability levels (0.1, 0.3, 0.5, 0.7, and 0.9), and evaluation scores for all thresholds will be averaged.

Implementation Details.
For a fair comparison, we employ the same network architecture as the baseline approach. Specifcally, for the MNIST experiment, we use the U-Net architecture without pretraining as [23].Moreover, for the RIGA experiment, the main framework utilizes the U-Net architecture with ResNet34 as the backbone. Parameters of the U-Net encoder are initialized with the pretrained model on ImageNet [41]. Te abovementioned network is implemented with the PyTorch platform and trained/tested on a Tesla V100 GPU with 32 GB of memory. Te proposed network is trained end-to-end using the Adam optimizer [42], and it takes about 4 hours to train our model with a mini-batch size of 4 for 60 epochs. Te learning rate is set to 1 × 10 − 4 .

Evaluation Metrics.
Various evaluation metrics, including the Dice similarity coefcient (DSC) and mean intersection over union (mIoU), were utilized to evaluate the performance of the proposed method for segmenting OC and OD relative to GT. Tese performance metrics are defned as follows: where TP, FP, FN, and TN represent true positives, false positives, false negatives, and true negatives, respectively, in the evaluation confusion matrix. Note that a model with higher metric values can predict more precise segmentation masks. All experimental results are reported as the average of the ten experiments conducted on the test set.

Performance of Our Methods.
Here, we provide a quantitative comparison among diferent loss functions including soft cross-entropy loss (CE), soft dice loss (DL), and soft focal loss [43] (FL). Table 1 displays the top fve combinations of loss functions with the highest D under the corresponding optimal hyperparameter settings, including the uncertainty threshold μ and the unsupervised loss weight λ 2 , where experiments are performed on the MNIST validation set. Te proposed method exhibits the best performance when both the supervision loss and the consistency regularization loss are CE. In Figure 4, we further present the comparison of model accuracy at diferent μ and λ 2 settings under this loss function combination on the MNIST and RIGA validation set. Moreover, the average unsupervised proportion corresponding to diferent μ in the two datasets is performed in Figure 4. By comparison, the optimal (μ, λ 2 ) combinations on the MNIST and RIGA datasets are (0.5, 0.005) and (0.5, 0.002), respectively, with corresponding unsupervised proportions of 8% and 4%.

Comparisons with Other Methods.
To demonstrate the advantage of the proposed method, we compare our method to the SOTA methods on the MNIST and RIGA datasets. We use the publicly released code with default parameters to retrain the SOTA methods with the same training/test set as ours for a fair comparison. Table 2 quantitatively compares our framework to three hard label methods, fve soft label methods, and other SOTA multirater labels' segmentation methods, including (a) Mode-UNet: UNet trained using a single label randomly selected; (b) MV-UNet: UNet trained using one-hot labels obtained by majority voting; (c) STAPLE-UNet: UNet trained using one-hot labels obtained by STAPLE [19]; (d) Average-UNet: UNet trained using soft labels obtained by average raters [31]; (e) GLS-UNet: UNet trained using soft labels smoothed by general label smoothing [44]; (f ) Sharpen-UNet: UNet trained using soft labels smoothed by label sharpen [45] under temperature (T) � 0.5 or 1.5; (g) Mixup-UNet: UNet trained using soft labels smoothed by Mixup; (h) SVLS-UNet: UNet trained using soft labels smoothed by SVLS [11]; (i) LNL [23]; (j) MRNet [24] on the MNIST and RIGA test sets.
As shown in Table 2, our proposed method consistently achieves superior performance compared with other methods. For value D, our method outperforms the SOTA method by 1.13% on the synthetic MNIST dataset. Figure 5 shows the visualization results, wherein our method recovers the most realistic result from several annotations containing obvious human errors. Additionally, compared to the suboptimal MRNet method, the segmentation results predicted by the proposed method are smoother and more structured at the edge. For the real-world dataset RIGA, the performance improvement is especially prominent for the retinal OC segmentation, where the inter-rater variability is more signifcant, with a 2.29% increase in D value over the current best method (listed in Table 2). Figure 6(a) visualizes fve examples of the silver standard and the corresponding segmentation results predicted by six diferent methods. As shown in Figure 6(b), the edge of OC occluded by blood vessels in the area with high inter-rater divergence (indicated by arrows) and, similarly, the highincidence area of misprediction (red areas) by other methods. Compared to other methods, the proposed method shows lower prediction errors in the aforementioned area, demonstrating the robustness of our method to difcult features.

Ablation Studies.
In this section, ablation studies are performed on the RIGA dataset over each component of the proposed method, including label smoothing (LS), consistency regularization (CR), and asymmetrical regularization (AR), as listed in Table 3. Meanwhile, the efect of diferent label smoothing techniques including GLS and Sharpen and SVLS on the performance of our method is also explored. Te baseline model is the UNet trained using soft labels of average raters. All experiments are performed with the same network structure and training hyperparameters, Sections 4.2 and 4.4. D ave represents the average value of D OC and D OD .
As shown in Table 3, the segmentation performance of the model reaches SOTA when all components are activated. As we sequentially remove the proposed components from the U-Net Baseline, the model performance degrades gradually. In particular, the inclusion of CR improved the    Computational Intelligence and Neuroscience baseline by 1.10%. Ten, the combination of CR and AR yielded an additional 0.55% improvement. Tis means that, in areas with higher rater inconsistency in annotations, the potential representation of image features is more reliable than in uncertain annotations. It is proved experimentally that features are also one of the important causes of observer variability rather than just the rater knowledge. In addition, adding SVLS alone improves D ave of baseline by 0.52% while utilizing it with CR and AR jointly improves D ave of the proposed method without SVLS by 1.05%. It demonstrates that the positive efects of CR and AR are further strengthened under the threshold of uncertainty with SVLS.

Generalization Capability.
To further verify the generalization capability of the proposed method, we additionally perform experiments on the kidney segmentation task of the QUBIQ multirater segmentation challenge. We use the same  and I (soft) as QUBIQ challenge, which can better evaluate the ability of the model to refect potential inter-rater agreement/disagreement. Specifcally, after the GTand prediction are binarized at multiple threshold levels (0.1, 0.3, 0.5, 0.7, and 0.9), the D and I metrics averaged across fve thresholds are D (soft) and I (soft) . As listed in Table 4, compared to the comparative methods, the proposed method achieves optimal performance on the QUBIQ-Prostate dataset and achieves suboptimal performance on the QUBIQ-Kidney dataset. Furthermore, the advantage of a low number of parameters facilitates the application of our method to other multirater datasets. Several representative examples of the comparison methods for such two datasets are visualized in Figure 7.

Conclusion
In this article, we focus on the utilization of rich annotation information from multiple clinical raters, which are relatively less explored but widely presented in medical image segmentation. Based on the deep learning method using soft labels, we proposed a local self-ensembling learning model related to pixelwise variance with the intention of reducing the reliance upon uncertain local labels and optimizing the feature space. Our method achieves performance improvement over the soft labels' learning method without requiring the introduction of extra parameters and structures. In addition, we incorporate structural uncertainty into soft labels via the label smoothing technique to further improve segmentation performance level. Empirical experiments demonstrated the overall superior performance of our method on a synthetic dataset and a real-world dataset.
Our method provides a solution for automatically learning a reliable clinical-aided diagnosis system using multirater annotations.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.  Computational Intelligence and Neuroscience 9